Tables to LaTeX: structure and content extraction from scientific tables

نویسندگان

چکیده

Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from embedded within PDF research is very challenging task due to the existence of visual features like spanning cells mathematical symbols equations. Most existing table structure identification methods tend ignore these academic writing features. In this paper, we adapt transformer-based language modeling paradigm for scientific extraction. Specifically, proposed model converts tabular image its corresponding LaTeX source code. Overall, outperform current state-of-the-art baselines achieve an exact match accuracy 70.35 49.69% on extraction, respectively. Further analysis demonstrates models efficiently identify number rows columns, alphanumeric characters, tokens, symbols.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-automatic Data Extraction from Tables

This paper describes a novel approach to automate extraction of useful information from tables and to record the knowledge procured in a structured data repository. The approach is based on modeling a behavior of an expert, who collects tabular data and maps them to a predefined relational schema. Experimental results demonstrate that the proposed approach predicts expert decisions with high ac...

متن کامل

From Tables to Frames

Turning the current Web into a Semantic Web requires automatic approaches for annotation of existing data since manual approaches will not scale in general. We here present an approach for automatic generation of frames out of tables which subsequently supports the automatic population of ontologies from table-like structures. The approach consists of a methodology, an accompanying implementati...

متن کامل

Disentangling the Structure of Tables in Scientific Literature

Within the scientific literature, tables are commonly used to present factual and statistical information in a compact way, which is easy to digest by readers. The ability to "understand" the structure of tables is key for information extraction in many domains. However, the complexity and variety of presentation layouts and value formats makes it difficult to automatically extract roles and re...

متن کامل

Plain Answers to Several Questions about Association/Independence Structure in Complete/Incomplete Contingency Tables

In this paper, we develop some results based on Relational model (Klimova, et al. 2012) which permits a decomposition of logarithm of expected cell frequencies under a log-linear type model. These results imply plain answers to several questions in the context of analyzing of contingency tables. Moreover, determination of design matrix and hypothesis-induced matrix of the model will be discusse...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal on Document Analysis and Recognition

سال: 2022

ISSN: ['1433-2833', '1433-2825']

DOI: https://doi.org/10.1007/s10032-022-00420-9